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Abstract 

Place recognition is one of the most challenging prob¬ 
lems in computer vision, and has become a key part in mo¬ 
bile robotics and autonomous driving applications for per¬ 
forming loop closure in visual SLAM systems. Moreover, the 
difficulty of recognizing a revisited location increases with 
appearance changes caused, for instance, by weather or il¬ 
lumination variations, which hinders the long-term applica¬ 
tion of such algorithms in real environments. In this paper 
we present a convolutional neural network (CNN), trained 
for the first time with the purpose of recognizing revisited lo¬ 
cations under severe appearance changes, which maps im¬ 
ages to a low dimensional space where Euclidean distances 
represent place dissimilarity. In order for the network to 
learn the desired invariances, we train it with triplets of 
images selected from datasets which present a challenging 
variability in visual appearance. The triplets are selected 
in such way that two samples are from the same location 
and the third one is taken from a different place. We val¬ 
idate our system through extensive experimentation, where 
we demonstrate better performance than state-of-art algo¬ 
rithms in a number of popular datasets. 

1. Introduction 

The process of identifying images that belong to the 
same location, usually known as place recognition, is still 
an open problem in computer vision. Place recognition is 
a key part in mobile robotics and autonomous driving ap¬ 
plications, such as vision-based simultaneous localization 
and mapping (SLAM) systems, where revisiting a location 
introduces important information which can be employed 
in the tasks of localization [20) and loop closure Ifl5l . It 
also can be applied in augmented reality applications, where 
the user obtains information about important places, monu¬ 
ments or texts from a single image taken with a smartphone 
camera. The difficulties induced by changes in the scenario, 



Figure 1. Frames extracted from the Nordland dataset |26) that 
belong to the same place in winter, spring, summer and fall. The 
proposed method is capable of recognizing the same location un¬ 
der challenging appearance changes. 

viewpoint, illumination or weather conditions makes place 
recognition a much more difficult task than one may in¬ 
tuitively think (see Figure [I]). Traditionally, place recog¬ 
nition has focused on scenarios without major appearance 
changes. In that context, most methods employ bags of vi¬ 
sual words inspired by [241 and El Bag-of-words (BoW) 
approaches have proven to work quickly and effectively in 
static scenes, but they have several drawbacks. They usually 
rely on traditional keypoint descriptors, such as SIFT fTTI . 
SURF (T), or BRIEF (3), which describe the local appear¬ 
ance of individual patches, limiting their descriptive power 
with respect to whole image methods, as observed by tea. 
Their performance in challenging environments strongly 
depends on the invariance of those descriptors to percep¬ 
tual changes. Convolutional neural networks (CNNs) are 
gaining importance in most classification tasks 0 . When 
used as generic feature generators, they often outperform 
the state-of-art algorithms even for tasks different to clas¬ 
sification (ED. However, their use in place recognition is 
limited to the exploitation of generic features extracted from 
the internal layers of pre-trained CNNs ID (25). 

In this paper, we propose a novel approach to place 
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recognition capable of detecting revisited places under ex¬ 
treme changes in weather, illumination, or external con¬ 
ditions. In contrast to previous algorithms which rely on 
visual descriptors, our algorithm works with the complete 
image, reducing unnecessary errors induced by posterior 
feature matching processes by providing a better estimate 
of place similarity. For that purpose, we have trained a 
CNN for the task of recognizing revisited places. To the 
best of our knowledge, this is the first CNN specifically 
trained to perform place recognition as opposed to using 
generic features extracted from networks trained for other 
tasks. We demonstrate that place recognition can be better 
resolved by discriminatively training a network for such a 
problem, since visual cues that are relevant for object clas¬ 
sification may not be optimal for place recognition. More¬ 
over, we claim that place recognition can be performed with 
a smaller network than those employed for object recogni¬ 
tion. We contribute to the state of the art with a CNN: 

o Capable of recognizing revisited places under chal¬ 
lenging appearance changes of the scene, including 
seasonal, time of day and outdoor/indoor changes. 

o Suitable for any long term, real time place recognition 
tasks which are often necessary in mobile robotics and 
autonomous navigation. 

We demonstrate these claims with extensive experimenta¬ 
tion in several challenging datasets, where we compare our 
proposal with two state-of-art algorithms: DBoW2 (T4l . 
and a generic network as in ED. Experiments show the 
better performance of our method, which recognizes pre¬ 
viously visited locations under severe appearance changes 
with a higher rate of success than the state-of-art algorithms, 
with an inferior computational burden than previous CNN- 
based methods on datasets where appearance changes are 
severe. 

2. Related Work 

As mentioned above, visual place recognition has been 
object of research under the field of SLAM, often as a key 
part of the localization and loop closing modules. One 
of the first SLAM techniques which introduced BoW in 
this context was FAB-MAP 0 , where a probabilistic ap¬ 
proach to place recognition based on the local appearance 
of each location was proposed. They also deal with percep¬ 
tual aliasing in the environment by introducing a generative 
model which implements some logic reasoning to discard 
false positives caused by this phenomena. However, the 
use of SURF features and the employment of the genera¬ 
tive model increases the computational burden. This was 
tackled in 0 with DBoW2, where for the first time they in¬ 
troduced the use bags of binary words obtained from BRIEF 
descriptors, reducing in more than an order of magnitude 


the time employed in the feature extraction process. The use 
of BRIEF, which is not rotation or scale invariant, limits the 
recognition task to scenes taken from the same viewpoint in 
planar trajectories. An improved version of this algorithm 
has been recently published in lfl4l . where the authors build 
a urban dictionary based on ORB lETTl which yields a better 
performance in popular datasets. 

A common problem to previous techniques is their poor 
behavior in place recognition under different illumination 
conditions and poorly textured environments, and also their 
limited invariance to scale and viewpoint. In ED. the au¬ 
thors deal with that by building a vocabulary tree that em¬ 
ploys straight lines in combination with the MSLD descrip¬ 
tor G2. which increases the robustness against changes 
in weather conditions. However, the evaluation sequences 
do not include strong perceptual changes, thus the system 
may not be suitable to long-term operations in changing en¬ 
vironments. This problem was tackled by Neubert et al. 
in ed, where they propose a place recognition algorithm 
capable of working across seasons. They argue that sea¬ 
sonal changes in the scene are predictable, and propose a 
superpixel-based algorithm (SP-APC) which is able to pre¬ 
dict those changes and then recognize the scene, with a pre¬ 
diction process based on a dictionary that learns from train¬ 
ing data how the appearance of the scene changes over the 
year. On the other hand, the algorithm is only tested with 
the Nordland dataset (26), which shows extreme seasonal 
changes, and hence it will not predict gradual changes in the 
environment. A different strategy works on local sequences 
instead of estimating the best single location, with the pro¬ 
posal of Milford and Wyeth as one of the most relevant con¬ 
tributions CCD. They propose SeqSLAM, a post-processing 
technique that recognizes sequences of locations previously 
visited, under challenging perceptual changes. Their ap¬ 
proach estimates the best match by taking into account not 
only the single location, but also imposing coherence with 
the surrounding sequence. Under this assumption, they ob¬ 
tain a good performance by only applying a local contrast 
enhancement to the input images (downsampled from the 
original datasets), and then comparing the normalized im¬ 
ages by processing the sum of absolute differences (SAD) 
between them. However, this procedure has several draw¬ 
backs. It only works with local and consistent sequences, 
which makes it impractical for applications that work with 
isolated images. It also may fail with big changes of view¬ 
point and rotation, and also suffers image aliasing since its 
viewpoint invariance is only due to extreme downscaling 
of the input images. Recently, another group of techniques 
has irrupted with promising results, motivated by the out¬ 
standing performance achieved by CNNs as generic fea¬ 
ture generators in several classification tasks 03 - In this 
context, a recent work is (25) , where the authors employ a 
pre-trained network named OverFeat (23), which was the 
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Figure 2. Architecture of the proposed network. The convolution and pooling stages are indicated at the top of the figure, and the sizes of 
the resulting data are shown on the bottom part. N is a local contrast normalization operation acting across channels as applied in (9). 


winner of the localization task of the ImageNet Large Scale 
Visual Recognition Challenge 2013 |22|. They study the 
use of the intermediate representations learned by the CNN 
as image features valuable for place recognition even under 
challenging appearance changes, with promising results. 

3. Methodology 

To solve the task of detecting if an image belongs to 
a previously visited place, we propose to train a Convo¬ 
lutional Neural Network to embed images in a low di¬ 
mensional space where Euclidean distance represents lo¬ 
cation dissimilarity. Our solution is inspired by works 
in content-based image retrieval, however, our network is 
trained to produce a feature vector invariant to drastic ap¬ 
pearance changes in the scene such as seasonal changes. 
To achieve this, we train the network using labeled datasets 
which present the same locations under different illumina¬ 
tion, point of view or weather conditions. We apply a train¬ 
ing technique similar to (28l , where the network presented 
with triplets of images, formed by a query image x ?: , an im¬ 
age from the same location, Xj , and an image of a different 
location, x^. In the following we describe the architecture 
and the training process of the proposed network. 

3.1. Architecture of the CNN 

In view of the difficulties in training a convolutional neu¬ 
ral network from scratch using a relatively small specialized 
dataset, we take the approach of modifying a pre-trained 
network. In particular, we resort to the reference CaffeNet 
network (§1, which mirrors the architecture of Krizhevsky et 
ai. m , from which we only keep the first four convolutional 
layers, replacing the rest with a single fully connected layer 
which is our descriptor output (see Figure [2]). Since we dis¬ 
card all the fully connected layers, we are not constrained to 
the original input size of227x227 pixels and instead work 
with a smaller input of 160 x 120. 

3.2. Description of the Cost Function 

In a nutshell, the network maps a M x N x C input image 
to a descriptor vector of length D, which corresponds to the 


activations of the output layer of the CNN, i.e.: 

h: R M xiVxC ,— > r d 

x ► h(x) (1) 

being h(x) the descriptor of the image x, whose Euclidean 
distances to other descriptors must be representative of lo¬ 
cation dissimilarity. In order to achieve this behavior, the 
network parameters are obtained by minimizing the fol¬ 
lowing objective function 

= argminj C + A ||u;||2 } (2) 

where the second term represents a regularization over the 
parameters of the network u>, and the first term C is the 
sum of the cost functions over all the triplets, that can be 
expressed as 


C= ^2 C(xi,Xj,x k ) (3) 

(x i ,x i ,x fc )G'T 

with C being the cost function for each triplet of images. 
The cost function employed is similar to that in lf28) . and 
can be expressed as: 


C(xi,Xj,Xfc) = max jo, 


1 - 


||h(xj) — h(xfc) || 2 -j 

/3+||h( Xi )-h(x,)|| 2 / 1 


This cost function is satisfiable when the distance of the dis¬ 
similar pair is larger than the distance of the similar pair by 
at least a margin /?, producing zero cost. This means that 
dissimilar descriptors will not continue to be separated in¬ 
definitely in the descriptor space during training. On the 
contrary, triplets not satisfying this condition will produce 
costs that the training process will aim to reduce by updat¬ 
ing the weights of the CNN accordingly. 


3.3. Training the CNN 

To achieve the desired invariances in the representation 
produced by the network, triplets must be chosen as to pro¬ 
vide relevant visual cues (see Figure [3] for an example). We 
train the network using a mixture of triplets from several 
datasets, which are detailed in the following sections, to 
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3.3.2 Alderley Dataset 



(a) Query image 



(b) Similar image 



(c) Different image 

Figure 3. Training triplet extracted from the KITTI dataset 0, 
where large viewpoint invariances are present. 


improve invariance to lighting, weather and point of view 
changes. The network is trained using the Caffe library 0, 
modified to include the previously described cost function. 

As previously explained, the weights of the four convo¬ 
lutional layers are fine-tuned from the CaffeNet reference 
network, an implementation of 0 , whereas the final fully 
connected layer is new. We scale the learning rate of the 
pre-trained layers by a factor of 1/1000 and fix the global 
learning rate at 0.001. The margin /3 is set to 1 and the 
regularization constant A to 0.0005. We train for a 40.000 
iterations, for a total of 1.2 million triplets, using portions 
fo the KITTI, Nordland, and Alderley datasets, which are 
described in the following sections. 


3.3.1 KITTI Dataset 

The odometry benchmark from the KITTI dataset 0 is 
comprised of 11 training sequences with accurate ground 
truth of the trajectory, and 10 test sequences without ground 
truth for evaluation. Both the training and the test sequences 
are stereo frames extracted from urban environments in day¬ 
light conditions. We select triplets in order to increase the 
robustness of the network to changes in viewpoint by choos¬ 
ing the similar pair in a wide variety of relative poses. We 
also check that the different pairs do not belong to the same 
place by employing the ground truth location (since loop 
closures exist in the sequences) Figure [3] depicts a triplet 
extracted from the KITTI dataset. 


We have also trained the network with the Alderley dataset 
m, which contains severe changes in illumination and 
weather conditions. This dataset is formed by two se¬ 
quences of 8 km along the suburb of Alderley in Brisbane 
(Australia). The first one was recorded during a clear morn¬ 
ing, while the second one was collected in a stormy night 
with low visibility (see Figure [4]). In order to achieve ro¬ 
bustness to the aforementioned changes, during training we 
provide the network with challenging triplets that combine 
images from both sequences (we have used the first 10k 
frames from the day sequence and their matches from the 
night sequence for the training, while reserving the rest for 
experimentation). 

3.3.3 Nordland Dataset 

The Nordland dataset ['26], extracted from the TV documen¬ 
tary “Nordlandsbanen - Minutt for Minutt” produced by the 
Norwegian Broadcasting Corporation NRK consists of a 
728 km long train journey connecting the cities of Trond¬ 
heim and Bodp in Norway. The sequence was recorded 
once in each season, and hence it contains challenging ap¬ 
pearance changes, as Figure [I] shows. Additionally, it pro¬ 
vides different weather conditions due to the large length of 
the dataset (the sequences are 10 hour long approximately). 
We generate triplets by providing two images from the same 
place in different seasons, and an image from another loca¬ 
tion in any season (we check that frames are actually from 
different places using the included GPS ground truth). 

4. Experimental Evaluation 

In order to validate the proposed network, we perform 
a series of experiments where we compare the behavior of 




(b) Stormy-night sequence 

Figure 4. Frames extracted from the Alderley dataset tea, where 
drastic illumination changes are present. 
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our system with two state-of-art techniques in place recog¬ 
nition: DBoW2 fl4l and a feature vector extracted from an 
internal layer of a neural network trained for object classi¬ 
fication as in [25]]. The actual implementations used are the 
official distribution of ORB-SLAM fl4l . and the CaffeNet 
l8l implementation of (51, which we simply name as Caf¬ 
feNet in this work. The resolutions of the input images are 
160 x 120 in our proposal, 227 x 227 in CaffeNet , and 
the native resolution of each dataset in DBoW2. In the fol¬ 
lowing, we first describe the methodology employed for the 
comparison, then we present a number of experiments with 
datasets from several environments, under different appear¬ 
ance changes. Finally, we also compare the computational 
cost of the algorithms and their feasibility for place recog¬ 
nition tasks, such as loop closure modules in visual SLAM 
algorithms. 

4.1. On Comparing Confusion Matrices 

The key element of a place recognition system is the esti¬ 
mation of the similarity between the compared images. For 
that purpose, we calculate a descriptor h(x^) for each in¬ 
put image x^, and then we estimate the similarity with other 
images by comparing the Euclidean distance from their de¬ 
scriptors. A common measurement widely employed in 
place recognition collects each distance (or score) in a con¬ 
fusion matrix, where the rows and the columns express 
the database and the query sequence, respectively, that is 
= ||h(xi) — h(xj)|| 2 . In our case, a normalized 
confusion matrix A4* can be defined as follows: 




max{M.(i, j)} 


(5) 


whose terms include the normalized Euclidean distance be¬ 
tween the descriptors associated to the i and j images from 
each sequence. For the methods with which we compare 
our proposal, the confusion matrices include the proposed 
normalized scores for each methodology, which are: 

o DBoW2 O: the proposed score is already normal¬ 
ized, but their approach associates high scores to simi¬ 
lar images, thus we estimate the complementary matrix 
before the comparison. 

o CaffeNet (25): we extract the convolutional layers out¬ 
puts conv4 , which present the best results for the tested 
datasets, and compare them using Euclidean distance 
as they propose. 


Place recognition methods for loop closure generally em¬ 
ploy post-processing techniques to find good matches which 
actually represent the same location in the confusion ma¬ 
trix, usually by looking for sequences of similar frames 
02 02- Any method that generates a confusion matrix 
can benefit from such post-processing techniques, including 


ours. For this reason, we perform our experimental compar¬ 
isons on the “raw” confusion matrix. A problem of using 
confusion matrices to compare the performance of different 
methods, is that it is quite difficult to establish a indicator of 
the quality of a confusion matrix, since there is no ground 
truth measurement of the place similarity between any two 
images. To overcome this issue, we perform a comparison 
based on synchronized sequences which do not present any 
loop closures, since in those cases the ground truth pair is 
placed on the diagonal of the confusion matrix. In order to 
generate a quality measurement of a confusion matrix, we 
start by only keeping its k smallest values. Then we plot the 
ratio of points that fall within the diagonal with respect to d , 
which is defined as the maximum distance to the diagonal 
to consider a point as an inlier (see Figure [7]). 

4.2. KITTI Dataset 

First, we compare the performance of the state-of-art 
algorithms with our proposal by processing the test se¬ 
quences from the KITTI dataset 0, which has a resolu¬ 
tion of 1241 x 376. Figure [5] depicts the confusion ma¬ 
trices obtained with the sequence KITTI-11 by comparing 
images from the left and right cameras, where we can ob¬ 
serve a good performance of all methods. We also notice 
that both DBoW2 and CaffeNet present a thin diagonal, and 
they do not show any good matches outside the diagonal. In 
contrast, the confusion matrix obtained with our approach 
presents a thicker diagonal, and also zones with low values 
which correspond to parts of the sequence where the car is 
either stopped or circulating with low speed. It implies that 
our approach is more robust to changes in point of view, and 
hence, is a more versatile option for place recognition tasks 
which may not require the camera to be in the exact same 
location. Nevertheless, it is quite difficult to extract quanti¬ 
tative conclusions with the observation of these charts. In¬ 
stead, Figure [6] depicts the 10 best matches for each input 
image. While both CaffeNet and our approach exhibit a 
good performance, with low dispersion around the diagonal, 
DBoW2 presents a considerable amount of outliers during 
the whole sequence. This is quantified in Figure [7j where 
we can observe that both CNN-based methods yield better 
results than DBoW2, while we observe a slightly superior 
performance of CaffeNet against our approach. However, 
it is worth considering that the features extracted from Caf¬ 
feNet are 64k-dimensional, whereas ours are much smaller, 
of 128 elements. This is of importance for sustainable long 
running place recognition as will be discussed in Section 

m 

4.3. Malaga Urban Dataset 

We also evaluate the performance of the techniques with 
the Malaga Urban Dataset (2), which contains frames ob¬ 
tained from a stereo camera, with a resolution of 1024 x 768, 
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(a) Our approach (b) DBoW2 (c) CaffeNet 

Figure 5. Confusion matrices belonging to our approach |5 (a) [ DBoW2 |5(b)[ and CaffeNet |5(c)| in the KITTI-11 sequence, with the left 
camera frames as database and the right one as query. Red tones indicate high values (dissimilar images), and blue indicates low values 
(similar pairs). We observe a good performance on both CNN-based methods, with the difference that our method produces low distances 
for images with considerable changes in viewpoint, thus generating more correspondences with medium-valued Euclidean distances. In 
contrast, DBoW2 presents a rigid behavior, since it is unable to detect matches with large variations in viewpoint. 





(a) Our approach (b) DBoW2 (c) CaffeNet 

Figure 6. Binarized confusion matrices comparing the left and right images from the KITTI-11 sequence, including the k = 10 minimum 
values for each query image corresponding to our approach |5 (a) [ DBoW2 |5(b)[ and CaffeNet |5(c)| The performance of the three methods 
is good in general, with a well-defined diagonal, but DBoW2 has a larger amount of outliers. 


and data acquired from five laser scanners during a 37 km 
sequence in Malaga (Spain) with cloudy weather and di¬ 
rect sunlight in several parts of the sequence. As can be 
observed in Figure [8j the urban structure presented by this 
dataset is quite different than the one in the KITTI se¬ 
quences, which makes it a challenging environment since 
none of the methods have been trained with this dataset. 
Figure [9] depicts the performance of the three compared 
methods when tested on the Malaga-10 (we employ the left 
sequence as database and the right one as query). While the 
CNN-based methods perform well, with a small superiority 
of CaffeNet, DBoW2 has a poor behavior, with a high ratio 
of outliers. This proves that both CNN-based approaches 
are capable of recognizing images from multiple environ¬ 
ments (even when they have not been trained with similar 


images), which makes them an interesting choice for life¬ 
long applications, while DBoW2 approach lacks this capa¬ 
bility. 

4.4. Nordland Dataset 

As mentioned above, the Nordland Dataset |2j6j] includes 
sequences with 1920 x 1080 resolution from the same per¬ 
spective during the four seasons of the year, which leads to 
severe changes in the appearance of the environment. For 
these experiments, we have employed the last hour of the 
dataset, which was not used for training, and removed the 
segments which include either tunnels or stations. Figure [T0| 
depicts the performance curves of the three approaches by 
comparing the most challenging sequence pair, summer and 
winter (other seasonal combinations yield similar results). 
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- - CaffeNet *—* Ours — DBoW2 

Figure 7. Performance curves of our approach, DBoW2 and Caf¬ 
feNet in the KITTI-11 sequence, when comparing the left and the 
right sequences. 



Figure 8. Frame extracted from the Malaga Urban Dataset (2). 

We observe the better performance of our proposal against 
CaffeNet, which presents considerably less inliers than our 
approach for all the diagonal widths, for both k = 5 and 
k = 10, which is logical since neither CaffeNet or DBoW2 
have been trained with the purpose of being robust to those 
appearance changes. 

4.5. Alderley Dataset 

We also have tested the robustness to challenging 
changes in weather and lightning conditions of the three 




-- CaffeNet Ours — DBoW2 

Figure 9. Performance curves of the three methods in the Malaga- 
10 sequence, with the left and the right images as database and 
query inputs, respectively. 

methods, by processing the last 5k frames from the day se¬ 
quence and their matches from the night sequence of the 
Alderley dataset (which has a resolution of 640 x 260). Fig¬ 
ure ED shows the outperformance of our proposal against 
CaffeNet and DBoW2, with a better ratio of inliers against 
diagonal width in all cases. However, it can be noticed 
that a low ratio is obtained by all three approaches, since 
it is a highly challenging dataset. Hence, the use of a post¬ 
processing technique based on sequentiality would be un¬ 
avoidable to obtain a system with a reasonable performance 
in similar scenarios. 

4.6. Performance 

Finally, we examine the computational performance in 
several aspects, which are presented in Table [T] Our tests 
run on an Intel Core i7-3770, while our GPU tests also rely 
on an NVidia GeForce GTX 790. First, we measure the 
time required to process a single image. In both CNN-based 
methods, the value includes loading the image and perform¬ 
ing a forward pass to obtain the feature vector. In the case 
of DBoW2 031 , we measure the time required to compute 
the bag-of-words histogram. Since the input image resolu- 
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- - CaffeNet *—* Ours — DBoW2 

Figure 10. Performance curves on a subset of the Nordland 
dataset, when using the summer sequence as database and the win¬ 
ter sequence as query inputs. 

Table 1. Performance comparison between DBoW2, CaffeNet, 
and our approach. We compare the average processing times (in 
both CPU and GPU), and the descriptor lengths. 


Value 

DBoW2 

CaffeNet 

Ours 

CPU Time (ms) 

4-22 

1450 

550 

GPU Time (ms) 

n/a 

30 

10 

Descriptor length 

200-500 

64k 

128 


tion for DBoW2 is variable depending on the dataset, we 
have included the minimum and maximum average times 
from all the sequences. The results indicate that DBoW2 
is less demanding than both CNN-based methods and that 
ours is three times faster than using the reference CaffeNet 
network. We then measure the size of the descriptor, which 
is relevant since the computational cost of calculating the 
confusion matrix (which is required for any loop closure 
system) increases with it. The length of the word histogram 
of DBoW2 is variable in the official implementation, and 
can be as long as the dictionary size (32k elements). In our 
experiments, the length of the histogram varied from 200 




- - CaffeNet *—* Ours — DBoW2 

Figure 11. Performance curves of the three methods in the final 
part of the Alderley dataset. We have employed the day sequence 
as database, and the challenging night sequence as query. 

to 500 elements on average, increasing in datasets where 
it performs well. On this matter, our method clearly out¬ 
performs both CaffeNet and DBoW2 with a smaller, fixed 
length descriptor of 128 elements. 

5. Conclusions 

We have trained a convolutional neural network to per¬ 
form place recognition under heavy appearance changes 
due to weather, seasons and perspective. The network em¬ 
beds images in a 128-dimensional space where samples 
from similar locations are separated by small Euclidean dis¬ 
tances. The network was trained using triplets of images 
from datasets where weather, lighting and point of view 
changes were present, in order to allow the network to learn 
invariances to these changes. The proposed network out¬ 
performs the state-of-art methods for place recognition in 
several challenging datasets, providing superior robustness 
to viewpoint and weather conditions changes. The small 
size of the resulting vector makes our system suitable for 
applications where long-term operation is required. 
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